{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In many domains it is common to plot scalar values as a function of time (or other single dimensions).  As long as the total number of datapoints is relatively low (in the tens of thousands, perhaps) and there are only a few separate curves involved, most plotting packages will do well.  However, for longer or more frequent sampling, you'll be required to subsample your data before plotting, potentially missing important peaks or troughs in the data.  And even just a few timeseries visualized together quickly run into highly misleading [overplotting](https://anaconda.org/jbednar/plotting_problems/notebook) issues, making the most recently plotted curves unduly prominent.\n",
    "\n",
    "For applications with many datapoints or when visualizing multiple curves, datashader provides a principled way to view *all* of your data.  In this example, we will synthesize several time-series curves so that we know their properties, and then show how datashader can reveal them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "import datashader as ds\n",
    "import datashader.transfer_functions as tf\n",
    "from collections import OrderedDict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create some fake timeseries data\n",
    "\n",
    "Here we create a fake time series signal, then generate many noisy versions of it.  We will also add a couple of \"rogue\" lines, with different statistical properties, and see how well those stand out from the rest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Constants\n",
    "np.random.seed(2)\n",
    "n = 100000                           # Number of points\n",
    "cols = list('abcdefg')               # Column names of samples\n",
    "start = 1456297053                   # Start time\n",
    "end = start + 60 * 60 * 24           # End time   \n",
    "\n",
    "# Generate a fake signal\n",
    "time = np.linspace(start, end, n)\n",
    "signal = np.random.normal(0, 0.3, size=n).cumsum() + 50\n",
    "\n",
    "# Generate many noisy samples from the signal\n",
    "noise = lambda var, bias, n: np.random.normal(bias, var, n)\n",
    "data = {c: signal + noise(1, 10*(np.random.random() - 0.5), n) for c in cols}\n",
    "\n",
    "# Add some \"rogue lines\" that differ from the rest \n",
    "cols += ['x'] ; data['x'] = signal + np.random.normal(0, 0.02, size=n).cumsum() # Gradually diverges\n",
    "cols += ['y'] ; data['y'] = signal + noise(1, 20*(np.random.random() - 0.5), n) # Much noisier\n",
    "cols += ['z'] ; data['z'] = signal # No noise at all\n",
    "\n",
    "# Pick a few samples from the first line and really blow them out\n",
    "locs = np.random.choice(n, 10)\n",
    "data['a'][locs] *= 2\n",
    "\n",
    "# Default plot ranges:\n",
    "x_range = (start, end)\n",
    "y_range = (1.2*signal.min(), 1.2*signal.max())\n",
    "\n",
    "print(\"x_range: {0} y_range: {0}\".format(x_range,y_range))\n",
    "\n",
    "# Create a dataframe\n",
    "data['Time'] = np.linspace(start, end, n)\n",
    "df = pd.DataFrame(data)\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting *all* the datapoints\n",
    "\n",
    "With datashader, we can plot *all* the datapoints for a given timeseries.  Let's select the first curve 'a' and draw it into an aggregate grid, connecting each datapoint in the series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=300, plot_width=900)\n",
    "aggs= OrderedDict((c, cvs.line(df, 'Time', c)) for c in cols)\n",
    "img = tf.shade(aggs['a'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "img"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result looks similar to what you might find in any plotting program, but it uses all 100,000 datapoints, and would work similarly for 1, 10, or 100 million points (determined by the `n` attribute above).  \n",
    "\n",
    "Why is using all the datapoints important? To see, let's downsample the data by a factor of 10, plotting 10,000 datapoints for the same curve:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampling = 10\n",
    "df2 = pd.DataFrame({'a':data['a'][::sampling], 'Time':np.linspace(start, end, int(n/sampling))})\n",
    "tf.shade(cvs.line(df2, 'Time', 'a'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting plot is similar, but now none of the \"blown up\" datapoints (sharp spikes) that were clearly visible in the first version are visible at all!  They didn't happen to be amongst the sampled points, and thus do not show up in this plot, which should never be a problem using datashader with all the points.\n",
    "\n",
    "\n",
    "### Overplotting problems\n",
    "\n",
    "What happens if we then overlay multiple such curves?  In a traditional plotting program, there would be serious issues with overplotting, because these curves are highly overlapping.  To show what would typically happen, let's merge the images corresponding to each of the curves:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "renamed = [aggs[key].rename({key: 'value'}) for key in aggs]\n",
    "merged = xr.concat(renamed, 'cols')\n",
    "tf.shade(merged.any(dim='cols'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `any` operator merges all the data such that any pixel that is lit up for any curve is lit up in the final result.  Clearly, it is difficult to see any structure in this fully overplotted data; all we can see is the envelope of these curves, i.e. the minimum and maximum value of any curve for any given time point. It remains completely unclear how the various curves in the set relate to each other.  Here we know that we put in one particularly noisy curve, which presumably determines the envelope, but there's no way to tell that from the plot.\n",
    "\n",
    "We can of course try giving each curve a different color:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors = [\"red\", \"grey\", \"black\", \"purple\", \"pink\",\n",
    "          \"yellow\", \"brown\", \"green\", \"orange\", \"blue\"]\n",
    "imgs = [tf.shade(aggs[i], cmap=[c]) for i, c in zip(cols, colors)]\n",
    "tf.stack(*imgs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But that doesn't help much; there are 10 curves, but only three or four colors are visible, due to overplotting.  Problems like that will just get much worse if there are 100, 1000, or 1 million curves.  Moreover, the results will look entirely different if we plot them in the opposite order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.stack(*reversed(imgs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having the visualization look completely different for an arbitrary choice like the plotting order is a serious problem if we want to understand the properties of this data, from the data itself.\n",
    "\n",
    "### Trends and outliers\n",
    "\n",
    "So, what might we be interested in understanding when plotting many curves?  One possibility is the combination of (a) the overall trends, and (b) any curves (and individual datapoints) that differ from those trends.  \n",
    "\n",
    "To look at the trends, we should combine the plots not by overplotting as in each of the above examples, but using operators that accurately reveal overlap between the curves.  When doing so, we won't try to discern individual curves directly by assigning them unique colors, but instead try to show areas of the curves that establish the trends and differ from them.  (Assigning colors per curve could be done as for the racial categories in census.ipynb, but that won't be further investigated here.)\n",
    "\n",
    "Instead of the `.any()` operator above that resulted in complete overplotting, or the `tf.stack` operator that depended strongly on the plotting order, let's use the `.sum()` operator to reveal the full patterns of overlap arithmetically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total = tf.shade(merged.sum(dim='cols'), how='linear')\n",
    "total"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With study, the overall structure of this dataset should be clear, according to what we know we put in when we created them:\n",
    "\n",
    "1. Individual rogue datapoints from curve 'a' are clearly visible (the seven sharp spikes)\n",
    "2. The trend is clearly visible (for the viridis colormap, the darkest greens show the areas of highest overlap)\n",
    "3. Line 'x' that gradually diverges from the trend is clearly visible (as the light blue (low-count) areas that increase below the right half of the plot).\n",
    "\n",
    "(Note that if you change the random seed or the number of datapoints, the specific values and locations will differ from those mentioned in the text.)\n",
    "\n",
    "None of these observations would have been possible with downsampled, overplotted curves as would be typical of other plotting approaches.\n",
    "\n",
    "\n",
    "### Highlighting specific curves\n",
    "\n",
    "The data set also includes a couple of traces that are difficult to detect in the `.sum()` plot above, one with no noise and one with much higher noise.  One way to detect such issues is to highlight each of the curves in turn, and display it in relation to the datashaded average values.  For instance, those two curves (each highlighted in red below) stand out against the pack as having less or more noise than is typical:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.stack(total, tf.shade(aggs['z'], cmap=[\"lightblue\", \"red\"]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.stack(total, tf.shade(aggs['y'], cmap=[\"lightblue\", \"red\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dynamic Plots\n",
    "\n",
    "In practice, it might be difficult to cycle through each of the curves to find one that's different, as done above.  Perhaps a criterion based on similarity could be devised, choosing the curve most dissimilar from the rest to plot in this way, which would be an interesting topic for future research.  In any case, one thing that can always be done is to make the plot fully interactive, so that the viewer can zoom in and discover such patterns dynamically.\n",
    "If you are looking at a live, running version of this notebook, just enable the wheel zoom or box zoom tools, and then zoom and pan as you wish:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bokeh.models import BoxZoomTool, DatetimeTickFormatter\n",
    "import bokeh.plotting as bp\n",
    "bp.output_notebook()\n",
    "\n",
    "formatter = DatetimeTickFormatter(microseconds=['%fus', '%H:%M %3fus', '%F %H:%M %3fus'],\n",
    "        milliseconds=['%S.%3Ns', '%H:%M %3Nms', '%F %H:%M %3Nms'],\n",
    "        seconds=['%Ss'],# '%H:%M:%S'],\n",
    "        minsec=[':%M:%S', '%H:%M:%S'],\n",
    "        hours=['%H:%M', '%F %H:%M'],\n",
    "        days=['%F'])\n",
    "\n",
    "def base_plot(tools='pan,wheel_zoom,reset'):\n",
    "    p = bp.figure(tools=tools, plot_width=600, plot_height=300,\n",
    "        x_range=x_range, y_range=y_range, outline_line_color=None,\n",
    "        min_border=0, min_border_left=0, min_border_right=0,\n",
    "        min_border_top=0, min_border_bottom=0)   \n",
    "    p.xgrid.grid_line_color = None\n",
    "    p.ygrid.grid_line_color = None\n",
    "    p.xaxis.formatter = formatter\n",
    "    p.add_tools(BoxZoomTool(match_aspect=True))\n",
    "    return p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datashader.bokeh_ext import InteractiveImage\n",
    "def create_image(x_range, y_range, w, h):\n",
    "    cvs = ds.Canvas(x_range=x_range, y_range=y_range,\n",
    "                    plot_height=h, plot_width=w)\n",
    "    aggs = OrderedDict((c,cvs.line(df, 'Time', c, agg=ds.count())) for c in cols)\n",
    "    renamed = [aggs[key].rename({key: 'value'}) for key in aggs]\n",
    "    merged = xr.concat(renamed, 'cols')\n",
    "    total = merged.sum(dim='cols')\n",
    "    img = tf.shade(total, how='linear')\n",
    "    return img\n",
    "    \n",
    "p = base_plot()\n",
    "InteractiveImage(p, create_image)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the diverging \"rogue line\" is immediately apparent, and if you zoom in towards the right you can see precisely how it differs from the rest.  The low-noise \"rogue line\" is much harder to see, but if you zoom in enough (particularly if you stretch out the x axis by zooming on the axis itself), you can see that one line goes through the middle of the pack, with different properties from the rest.  The datashader team is working on support for hover-tool information to reveal what line that was, and in general on better support for exploring large timeseries (and other curve) data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple trends\n",
    "\n",
    "The above curves were very highly overlapping to illustrate a worst-case scenario, but now let's look at a case with curves that diverge more strongly from each other.   For instance, imagine that we have a simulation where one of three qualitatively different decisions are made at the starting time (perhaps three different parameter settings), along with noisy samples from each group, and we want to see what effect that has on the overall expected state into the future.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's create time series in three different groups, each of which are similar to each other, but each group differing from each other as they diverge from a common starting point.  For instance, in a simulation run, each group could be from setting a different parameter, while each noisy version could be different runs with a different seed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "signals = [np.random.normal(0, 0.3, size=n).cumsum() + 50,\n",
    "           np.random.normal(0, 0.3, size=n).cumsum() + 50,\n",
    "           np.random.normal(0, 0.3, size=n).cumsum() + 50]\n",
    "data = {c: signals[i%3] + noise(1+i, 5*(np.random.random() - 0.5), n)  for (i,c) in enumerate(cols)}\n",
    "y_range = (1.2*min([s.min() for s in signals]), 1.2*max([s.max() for s in signals]))    \n",
    "\n",
    "data['Time'] = np.linspace(start, end, n)\n",
    "df = pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And let's examine the result interactively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_image3(x_range, y_range, w, h, df=df):\n",
    "    cvs = ds.Canvas(x_range=x_range, y_range=y_range,\n",
    "                    plot_height=h, plot_width=w)\n",
    "    aggs = [cvs.line(df, 'Time', c, agg=ds.count()).rename({c: 'value'}) for c in cols]\n",
    "    merged = xr.concat(aggs, 'cols')\n",
    "    total = merged.sum(dim='cols')\n",
    "    image = tf.shade(total, how='linear')\n",
    "    return tf.dynspread(image)\n",
    "\n",
    "p2 = base_plot()\n",
    "InteractiveImage(p2, create_image3, df=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the three groups can clearly be seen, at least once they diverge sufficiently, as well as the areas of high overlap (high probability of being in that state at that time). Additional patterns are visible when zooming in, all the way down to the individual datapoints, and again it may be useful to zoom first on the x axis (to make enough room on your screen to distinguish the datapoints, since there are 100,000 of them and only a few thousand pixels at most on your screen!). And note again that the interactive plots require a running server if you want to see more datapoints as you zoom in; static exports of this notebook won't support full zooming."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Plotting large numbers of time series together\n",
    "\n",
    "The examples above all used a small number of very long time series, which is one important use case for Datashader.  Another important use case is visualizing very large numbers of time series, even if each individual curve is relatively short. If you have hundreds of thousands of timeseries, putting each one into a Pandas dataframe column and aggregating it individually will not be very efficient. \n",
    "\n",
    "Luckily, Datashader can render arbitrarily many separate curves, limited only by what you can fit into a Dask dataframe (which in turn is limited only by your system's total disk storage). Instead of having a dataframe with one column per curve, you would instead use a single column for 'x' and one for 'y', with an extra row containing a NaN value to separate each curve from its neighbor (so that no line will connect between them). In this way you can plot millions or billions of curves efficiently.\n",
    "\n",
    "To make it simpler to construct such a dataframe for the special case of having multiple time series of the same length, Datashader includes a utility function accepting a 2D Numpy array and returning a NaN-separated dataframe. (See [datashader issue 286](https://github.com/bokeh/datashader/issues/286#issuecomment-334619499) for background.) \n",
    "\n",
    "As an example, let's generate 100,000 sequences, each with 10 points, as a Numpy array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 100000\n",
    "points = 10\n",
    "data = np.random.normal(0, 100, size = (n, points))\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a suitable Datashader-compatible tidy dataframe using the utility:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = ds.utils.dataframe_from_multiple_sequences(np.arange(points), data)\n",
    "df.head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then render it as usual:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cvs = ds.Canvas(plot_height=400, plot_width=1000)\n",
    "agg = cvs.line(df, 'x', 'y', ds.count())   \n",
    "img = tf.shade(agg, how='eq_hist')\n",
    "img"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the 10 high and low peaks each represent one of the 10 values in each sequence, with lines connecting those random values to the next one in the sequence.  Thanks to the `eq-hist` colorization, you can see subtle differences in the likelihood of any particular pixel being crossed by these line segments, with the values towards the middle of each gap most heavily crossed as you would expect.  You'll see a similar plot for 1,000,000 or 10,000,000 curves, and much more interesting plots if you have real data to show!"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
